4a¶

GSE84133 is a single-cell RNA-seq dataset of human pancreatic islets from multiple donors, processed using the 10X Genomics platform. The raw count matrices were downloaded from GEO. Non-gene annotation rows were removed, and sample metadata (like barcode and cluster assignments) was separated to retain only numeric gene expression data. QC metrics (total counts, number of genes, and mitochondrial gene percentage) were calculated to assess cell quality. Cells with low gene counts and genes expressed in very few cells were filtered out to remove noise. The data was then normalized to 10,000 counts per cell to account for sequencing depth differences and log-transformed to stabilize variance across genes. All samples were concatenated into a single dataset. Highly variable genes were identified to focus the analysis on the most informative genes. The data was scaled to zero mean and unit variance, and batch effects between samples were corrected to avoid technical biases. Principal Component Analysis (PCA) was performed for dimensionality reduction. A neighborhood graph was computed based on the PCA space to capture the structure of the data. Cells were clustered using the Leiden algorithm, and marker genes were identified for each cluster. and based on the expression of known marker genes, biological cell type labels were assigned to the clusters.¶

4b¶

Step | Key Tools Used |¶

Download raw count matrices | wget, |¶

Remove non-gene annotations | pandas |¶

Separate metadata | pandas |¶

Calculate QC metrics | Scanpy |¶

Filter low-quality cells/genes | Scanpy |¶

Normalize counts | Scanpy |¶

Log-transform normalized data | Scanpy |¶

Concatenate samples | Scanpy, AnnData |¶

Identify highly variable genes (HVGs) | Scanpy |¶

Scale data (zero mean, unit variance) | Scanpy |¶

Batch correction | Harmony |¶

Perform PCA | Scanpy |¶

Compute neighborhood graph | Scanpy |¶

Cluster cells (Leiden) | Scanpy |¶

Find marker genes | Scanpy |¶

Assign cell type annotations | Scanpy, known markers |¶

In [ ]:
# Install packages
In [161]:
import os
import requests
import pandas as pd
import numpy as np
import scanpy as sc
import subprocess
import anndata as ad
import matplotlib.pyplot as plt
import seaborn as sns
import harmonypy as hm
import warnings
import plotly.express as px

np.random.seed(42)

Download and Load dataframes¶

Load raw count matrices from GSM samples¶

In [105]:
os.makedirs("data", exist_ok=True)

gsm_ids = {
    "GSM2230757": "GSM2230757_human1_umifm_counts.csv.gz",
    "GSM2230758": "GSM2230758_human2_umifm_counts.csv.gz",
    "GSM2230759": "GSM2230759_human3_umifm_counts.csv.gz",
    "GSM2230760": "GSM2230760_human4_umifm_counts.csv.gz"
}

for gsm_id, filename in gsm_ids.items():
    subdir = gsm_id[:7] + "nnn"
    url = f"https://ftp.ncbi.nlm.nih.gov/geo/samples/{subdir}/{gsm_id}/suppl/{filename}"
    output_path = os.path.join("data", filename)

    if os.path.isfile(output_path):
        print(f"{filename} already exists, skipping.")
        continue

    print(f" Downloading {filename}...")
    try:
        subprocess.run(["wget", "-O", output_path, url], check=True)
        print(f"Downloaded {filename}")
    except subprocess.CalledProcessError as e:
        print(f"Failed to download {filename}: {e}")
GSM2230757_human1_umifm_counts.csv.gz already exists, skipping.
GSM2230758_human2_umifm_counts.csv.gz already exists, skipping.
GSM2230759_human3_umifm_counts.csv.gz already exists, skipping.
GSM2230760_human4_umifm_counts.csv.gz already exists, skipping.
In [107]:
dataframes = {}

for gsm_id, filename in gsm_ids.items():
    file_path = os.path.join("data", filename)
    try:
        df = pd.read_csv(file_path, index_col=0)
        dataframes[gsm_id] = df
        print(f" Loaded {filename} with shape {df.shape}")
    except Exception as e:
        print(f" Error loading {filename}: {e}")
 Loaded GSM2230757_human1_umifm_counts.csv.gz with shape (1937, 20127)
 Loaded GSM2230758_human2_umifm_counts.csv.gz with shape (1724, 20127)
 Loaded GSM2230759_human3_umifm_counts.csv.gz with shape (3605, 20127)
 Loaded GSM2230760_human4_umifm_counts.csv.gz with shape (1303, 20127)

Remove annotations¶

Separate out metadata(barcode, assigned_cluster, pk) so gene expression matrices are clean and numeric.¶

In [109]:
# Initialize dictionaries to store the columns separately
barcode_cluster_pk_dataframes = {}

# Loop through each GSM DataFrame in the dictionary
for gsm_id, df in dataframes.items():
    # Separate the barcode, assigned_cluster, and pk columns
    barcode_cluster_pk_data = df[['barcode', 'assigned_cluster', 'pk']]
    
    # Store this separated data in a new dictionary
    barcode_cluster_pk_dataframes[gsm_id] = barcode_cluster_pk_data
    
    # Remove the columns 'barcode', 'assigned_cluster', and 'pk' from the original dataframe
    df.drop(columns=['barcode', 'assigned_cluster', 'pk'], inplace=True)

    # Print shape of updated DataFrame
    print(f"{gsm_id}: {df.shape[0]} rows and {df.shape[1]} columns after removing specified columns.")

# The `barcode_cluster_pk_dataframes` dictionary now contains the separated columns for each GSM.
GSM2230757: 1937 rows and 20124 columns after removing specified columns.
GSM2230758: 1724 rows and 20124 columns after removing specified columns.
GSM2230759: 3605 rows and 20124 columns after removing specified columns.
GSM2230760: 1303 rows and 20124 columns after removing specified columns.

QC Metrics¶

In [111]:
adata_dict = {}

# Convert dataframes into AnnData objects and store them in adata_dict
for gsm_id, df in dataframes.items():
    adata = sc.AnnData(df)
    adata_dict[gsm_id] = adata

# Calculate QC metrics for each sample
for gsm_id, adata in adata_dict.items():
    # Identify mitochondrial genes (assuming "MT-" prefix for human data)
    mito_genes = adata.var_names.str.startswith('MT')
    
    # Add percentage of mitochondrial counts to the adata.obs
    adata.obs['pct_counts_mt'] = adata[:, mito_genes].X.sum(axis=1) / adata.X.sum(axis=1) * 100
    
    # Calculate other QC metrics (total counts, number of genes, etc.)
    sc.pp.calculate_qc_metrics(adata, inplace=True)
    print(f"Calculated QC metrics for {gsm_id}")

# Check the available GSM IDs in the data
print("Available GSM IDs in the data:", adata_dict.keys())

# Print the QC metrics for each sample
for gsm_id, adata in adata_dict.items():
    print(f"\nQC Metrics for {gsm_id}:")
    print(f"Total counts: {adata.obs['total_counts'].mean():.2f}")
    print(f"Number of genes: {adata.obs['n_genes_by_counts'].mean():.2f}")
    print(f"Percentage of mitochondrial counts: {adata.obs['pct_counts_mt'].mean():.2f}%")
Calculated QC metrics for GSM2230757
Calculated QC metrics for GSM2230758
Calculated QC metrics for GSM2230759
Calculated QC metrics for GSM2230760
Available GSM IDs in the data: dict_keys(['GSM2230757', 'GSM2230758', 'GSM2230759', 'GSM2230760'])

QC Metrics for GSM2230757:
Total counts: 5805.05
Number of genes: 1923.72
Percentage of mitochondrial counts: 0.27%

QC Metrics for GSM2230758:
Total counts: 5109.00
Number of genes: 1893.88
Percentage of mitochondrial counts: 0.22%

QC Metrics for GSM2230759:
Total counts: 5858.42
Number of genes: 1750.07
Percentage of mitochondrial counts: 0.30%

QC Metrics for GSM2230760:
Total counts: 6729.19
Number of genes: 2202.82
Percentage of mitochondrial counts: 0.23%
In [113]:
for gsm_id, adata in adata_dict.items():
    print(f"\nGenerating QC plots for {gsm_id}")
    
    # Violin plots
    sc.pl.violin(
        adata,
        ['total_counts', 'n_genes_by_counts', 'pct_counts_mt'],
        jitter=0.4,
        multi_panel=True
    )
    plt.suptitle(f"{gsm_id} - Violin plots", fontsize=14)
    plt.tight_layout()
    plt.show()
Generating QC plots for GSM2230757
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
Generating QC plots for GSM2230758
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
Generating QC plots for GSM2230759
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
Generating QC plots for GSM2230760
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
In [275]:
print(adata_dict.keys())
dict_keys(['GSM2230757', 'GSM2230758', 'GSM2230759', 'GSM2230760'])

Filter out low-quality cells¶

Based on QC plots, I filtered cells expressing fewer than 200 genes, removed genes detected in fewer than 3 cells, and excluded cells with total UMI counts below 1,000 or above 20,000. Cells with >5% mitochondrial gene expression were removed to ensure high data quality for clustering and annotation.¶

In [115]:
# Thresholds
min_genes_per_cell = 200
min_cells_per_gene = 3
min_umi_per_cell = 1000
max_umi_per_cell = 20000
max_mito_percent = 5.0  # in %

# Create a new dictionary to hold filtered data
filtered_dataframes = {}

for gsm_id, df in dataframes.items():
    print(f"\nFiltering {gsm_id}...")

    # Calculate QC metrics
    total_counts = df.sum(axis=1)
    gene_counts = (df > 0).sum(axis=1)
    
    # Identify mitochondrial genes
    mito_genes = [gene for gene in df.columns if gene.startswith('MT')]
    mito_counts = df[mito_genes].sum(axis=1)
    mito_percent = (mito_counts / total_counts) * 100

    # Apply all filters
    cells_pass_filter = (
        (gene_counts >= min_genes_per_cell) &
        (total_counts >= min_umi_per_cell) &
        (total_counts <= max_umi_per_cell) &
        (mito_percent <= max_mito_percent)
    )
    df_filtered_cells = df[cells_pass_filter]
    print(f"  Cells retained after QC: {df_filtered_cells.shape[0]}")

    # Filter genes (columns) expressed in fewer than min_cells_per_gene cells
    genes_pass_filter = (df_filtered_cells > 0).sum(axis=0) >= min_cells_per_gene
    df_filtered = df_filtered_cells.loc[:, genes_pass_filter]
    print(f"  Genes retained after QC: {df_filtered.shape[1]}")

    # Store the filtered dataframe
    filtered_dataframes[gsm_id] = df_filtered

# Now filter annotations to match
filtered_annotations = {}

for gsm_id in filtered_dataframes:
    filtered_cells = filtered_dataframes[gsm_id].index
    ann_df = barcode_cluster_pk_dataframes[gsm_id]
    filtered_ann_df = ann_df.loc[filtered_cells]
    filtered_annotations[gsm_id] = filtered_ann_df

    print(f"{gsm_id}: Filtered annotations shape: {filtered_ann_df.shape}")
Filtering GSM2230757...
  Cells retained after QC: 1922
  Genes retained after QC: 14713

Filtering GSM2230758...
  Cells retained after QC: 1724
  Genes retained after QC: 14883

Filtering GSM2230759...
  Cells retained after QC: 3563
  Genes retained after QC: 15202

Filtering GSM2230760...
  Cells retained after QC: 1290
  Genes retained after QC: 14444
GSM2230757: Filtered annotations shape: (1922, 3)
GSM2230758: Filtered annotations shape: (1724, 3)
GSM2230759: Filtered annotations shape: (3563, 3)
GSM2230760: Filtered annotations shape: (1290, 3)

Normalize the data¶

Removes library size bias across cells. (per cell to 10,000 counts)¶

In [117]:
normalized_data = {}

for gsm_id, df in filtered_dataframes.items():
    # Create AnnData object
    adata = ad.AnnData(df)

    # Normalize: scale each cell's total counts to 10,000
    sc.pp.normalize_total(adata, target_sum=1e4)

    # Store normalized AnnData (not yet log-transformed)
    normalized_data[gsm_id] = adata
In [119]:
print(f"Number of negative entries: {(adata.X < 0).sum()}")
Number of negative entries: 0
In [121]:
sc.pp.normalize_total(adata, target_sum=1e4)
print(f"Number of negative entries: {(adata.X < 0).sum()}")
Number of negative entries: 0

Log-transform the normalized data (log1p)¶

Stabilizes variance, improves comparability across genes/cells.¶

In [123]:
log_transformed_data = {}

for gsm_id, adata in normalized_data.items():
    # Log1p transformation (no plots)
    sc.pp.log1p(adata)

    # Store log-transformed data
    log_transformed_data[gsm_id] = adata

Concatenate the dataframes¶

Combine all samples into one AnnData or matrix, now comparable due to shared scale.¶

In [125]:
# Concatenate all log-transformed AnnData objects
combined_adata = sc.concat(list(log_transformed_data.values()), 
                           label='batch', 
                           keys=list(log_transformed_data.keys()),
                           index_unique='-')

# Check the combined object
print(combined_adata)
AnnData object with n_obs × n_vars = 8499 × 13774
    obs: 'batch'
In [127]:
print(combined_adata.shape)  # cells x genes
print(combined_adata.obs['batch'].value_counts())  # how many cells per batch
(8499, 13774)
batch
GSM2230759    3563
GSM2230757    1922
GSM2230758    1724
GSM2230760    1290
Name: count, dtype: int64

Identify highly variable genes (HVGs)¶

Focus on the most distinguishing genes for downstream dimensionality reduction.¶

In [129]:
sc.pp.highly_variable_genes(combined_adata, 
                             flavor='seurat', 
                             n_top_genes=2000)

# Print summary
print(f"Number of highly variable genes selected: {combined_adata.var['highly_variable'].sum()}")


sc.pl.highly_variable_genes(combined_adata)
Number of highly variable genes selected: 2000
No description has been provided for this image

Scale the data (zero-mean, unit-variance)¶

Essential for PCA and clustering to perform well.¶

In [131]:
sc.pp.scale(combined_adata, zero_center=True, max_value=None)

# Check the scaled data
print(combined_adata)
AnnData object with n_obs × n_vars = 8499 × 13774
    obs: 'batch'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg'

Perform PCA¶

Reduces noise and dimensionality before batch correction and UMAP.¶

In [133]:
sc.tl.pca(combined_adata, svd_solver='arpack', random_state= 42 )

# Plot the PCA results (2D)
sc.pl.pca(combined_adata, color='batch') 

# Show the explained variance ratio to understand the amount of variance captured by each principal component
print(combined_adata.uns['pca']['variance_ratio'])
No description has been provided for this image
[0.06281348 0.04058672 0.0271614  0.0235323  0.02264049 0.01322893
 0.00852874 0.00778488 0.00736579 0.00519212 0.00502104 0.00461948
 0.00422595 0.00370149 0.00320128 0.00308771 0.00270745 0.0026192
 0.00260219 0.00252267 0.00238146 0.00231575 0.00230704 0.00217362
 0.00211461 0.00203691 0.00202781 0.00197115 0.00186409 0.00178382
 0.00176703 0.00171424 0.00167931 0.00162129 0.00161178 0.00160302
 0.00153201 0.00150662 0.0014808  0.00146569 0.00144795 0.00143309
 0.00140103 0.00139155 0.00137676 0.00135418 0.00133402 0.00132486
 0.00132042 0.00131809]

Batch correction using Harmony¶

Removes technical batch effects due to different samples/libraries.¶

In [135]:
sc.pp.neighbors(combined_adata, use_rep='X_pca', random_state= 42)
sc.tl.umap(combined_adata,random_state= 42)

# Save the UMAP coordinates before batch correction
combined_adata.obsm['X_umap_orig'] = combined_adata.obsm['X_umap'].copy()
In [137]:
# Run Harmony on the PCA space
ho = hm.run_harmony(
    combined_adata.obsm['X_pca'],  # PCA matrix
    combined_adata.obs,            # Metadata (the obs table)
    vars_use=['batch']             # Column name(s) in obs that define batches
)

# Save the corrected PCA embeddings into a new slot
combined_adata.obsm['X_pca_harmony'] = ho.Z_corr.T  # Harmony returns PCs x cells; transpose!

# Recompute neighbors on Harmony-corrected PCA
sc.pp.neighbors(combined_adata, use_rep='X_pca_harmony', random_state= 42)  # <- very important!

# Now compute UMAP
sc.tl.umap(combined_adata, random_state= 42)

# Save the UMAP coordinates
combined_adata.obsm['X_umap_harmony'] = combined_adata.obsm['X_umap'].copy()

# Plot side-by-side comparison for before and after correctn
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# Plot original UMAP (before Harmony)
sc.pl.embedding(
    combined_adata, 
    basis='X_umap_orig', 
    color='batch', 
    title='Before Batch Correction', 
    ax=axs[0], 
    show=False
)

# Plot Harmony-corrected UMAP
sc.pl.embedding(
    combined_adata, 
    basis='X_umap_harmony', 
    color='batch', 
    title='After Harmony Correction', 
    ax=axs[1], 
    show=False
)

plt.tight_layout()
plt.show()
2025-05-03 08:58:26,280 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
2025-05-03 08:58:27,049 - harmonypy - INFO - sklearn.KMeans initialization complete.
2025-05-03 08:58:27,066 - harmonypy - INFO - Iteration 1 of 10
2025-05-03 08:58:27,741 - harmonypy - INFO - Iteration 2 of 10
2025-05-03 08:58:28,395 - harmonypy - INFO - Iteration 3 of 10
2025-05-03 08:58:29,058 - harmonypy - INFO - Iteration 4 of 10
2025-05-03 08:58:29,700 - harmonypy - INFO - Iteration 5 of 10
2025-05-03 08:58:30,366 - harmonypy - INFO - Iteration 6 of 10
2025-05-03 08:58:30,993 - harmonypy - INFO - Iteration 7 of 10
2025-05-03 08:58:31,360 - harmonypy - INFO - Iteration 8 of 10
2025-05-03 08:58:31,741 - harmonypy - INFO - Converged after 8 iterations
No description has been provided for this image

Compute neighborhood graph (on batch-corrected PCA)¶

For clustering and manifold learning (UMAP).¶

In [141]:
sc.tl.leiden(combined_adata, resolution=0.5)

Cluster cells (Leiden)¶

Group cells based on shared transcriptomic profile.¶

In [143]:
sc.pl.umap(combined_adata, color='leiden', title='Clusters after Harmony batch correction', legend_loc='on data')

sc.pp.neighbors(
    combined_adata, 
    use_rep='X_pca_harmony'  # use the Harmony-corrected PCs
)
No description has been provided for this image

Visualize using UMAP¶

Project high-dimensional data¶

In [145]:
# 1. Re-run UMAP asking for 3 components
sc.tl.umap(
    combined_adata, 
    n_components=3
)

# 2. Plot 3D UMAP
sc.pl.umap(
    combined_adata, 
    color=['batch', 'leiden'],  # You can color by batch or cluster
    title='3D UMAP after Harmony batch correction',
    projection='3d'
)
WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.
No description has been provided for this image

Annotate the cells¶

Find Marker Genes¶

In [147]:
# Define pancreas cell markers
marker_genes = {
    'Alpha': ['GCG', 'ARX'],
    'Beta': ['INS', 'MAFA', 'NKX6-1'],
    'Delta': ['SST', 'HHEX'],
    'PP': ['PPY'],
    'Acinar': ['PRSS1', 'CPA1', 'AMY2A'],
    'Ductal': ['KRT19', 'CFTR', 'SOX9'],
    'Endothelial': ['PECAM1', 'VWF'],
    'Macrophage': ['CD68', 'LYZ'],
    'Stellate': ['COL1A1', 'PDGFRB']
}
In [75]:
sns.set(style="whitegrid")

valid_genes = {cell_type: [gene for gene in genes if gene in combined_adata.var_names] 
               for cell_type, genes in marker_genes.items()}

for cell_type, genes in valid_genes.items():
    if genes:
        sc.tl.score_genes(combined_adata, gene_list=genes, score_name=f'{cell_type}_score', use_raw=False)

score_cols = [f'{ct}_score' for ct in valid_genes.keys()]
combined_adata.obs['cell_type'] = combined_adata.obs[score_cols].idxmax(axis=1)
combined_adata.obs['cell_type'] = combined_adata.obs['cell_type'].str.replace('_score', '')

if 'X_umap' not in combined_adata.obsm:
    print("UMAP embeddings not found.")
else:
    print("UMAP embeddings found.")

sc.pl.umap(combined_adata, 
           color='cell_type', 
           palette='tab10', 
           title='UMAP with Cell Types',
           legend_loc=None,  
           show=False)

plt.grid(True, linestyle='-', linewidth=0.5, alpha=0.6)
plt.title('UMAP with Cell Types')

for i, cell_type in enumerate(combined_adata.obs['cell_type'].unique()):
    subset = combined_adata[combined_adata.obs['cell_type'] == cell_type]
    plt.text(subset.obsm['X_umap'][:, 0].mean(), 
             subset.obsm['X_umap'][:, 1].mean(), 
             cell_type, 
             horizontalalignment='center', 
             verticalalignment='center', 
             fontsize=12, 
             weight='bold')

plt.show()
UMAP embeddings found.
No description has been provided for this image
In [151]:
import plotly.express as px


sns.set(style="whitegrid")

# Score and assign cell types
valid_genes = {cell_type: [gene for gene in genes if gene in combined_adata.var_names] 
               for cell_type, genes in marker_genes.items()}

for cell_type, genes in valid_genes.items():
    if genes:
        sc.tl.score_genes(combined_adata, gene_list=genes, score_name=f'{cell_type}_score', use_raw=False)

score_cols = [f'{ct}_score' for ct in valid_genes.keys()]
combined_adata.obs['cell_type'] = combined_adata.obs[score_cols].idxmax(axis=1)
combined_adata.obs['cell_type'] = combined_adata.obs['cell_type'].str.replace('_score', '')

# Check UMAP
if 'X_umap' not in combined_adata.obsm:
    print("UMAP embeddings not found.")
else:
    print("UMAP embeddings found.")

# Run UMAP again if needed
if combined_adata.obsm['X_umap'].shape[1] < 3:
    print("Warning: UMAP does not have 3 dimensions. Running 3D UMAP...")
    sc.tl.umap(combined_adata, n_components=3)

# Create a DataFrame for plotting
umap_df = pd.DataFrame(
    combined_adata.obsm['X_umap'],
    columns=['UMAP1', 'UMAP2', 'UMAP3'],
    index=combined_adata.obs.index
)
umap_df['cell_type'] = combined_adata.obs['cell_type'].values

# Interactive 3D plot
fig = px.scatter_3d(
    umap_df, 
    x='UMAP1', y='UMAP2', z='UMAP3',
    color='cell_type',
    title='Interactive 3D UMAP with Cell Types',
    opacity=0.7,
    height=800,
    width=900
)

fig.update_layout(
    legend=dict(itemsizing='constant'),
    margin=dict(l=0, r=0, b=0, t=40),
    scene=dict(
        xaxis_title='UMAP1',
        yaxis_title='UMAP2',
        zaxis_title='UMAP3'
    )
)

fig.show()
UMAP embeddings found.
In [163]:
warnings.filterwarnings("ignore")

# Find marker genes per cell type using Wilcoxon test
sc.tl.rank_genes_groups(
    combined_adata,
    groupby='cell_type',     
    method='wilcoxon',       
    n_genes=20              
)

# top 5 genes per cell type
result = combined_adata.uns['rank_genes_groups']
groups = result['names'].dtype.names  # the different cell types

top_markers = []
for group in groups:
    top_markers.extend(result['names'][group][:5])  # top 5 markers per cell type

top_markers = list(set(top_markers))  

# 6. Plot a DotPlot for top 5 marker genes
import seaborn as sns

sns.set_context('notebook', font_scale=1.5)  
sns.set_palette('coolwarm')  

# Create the DotPlot
fig, ax = plt.subplots(figsize=(14, 6))
sc.pl.dotplot(
    combined_adata,
    var_names=top_markers,
    groupby='cell_type',
    standard_scale='var',    
    dot_max=0.7,              
    color_map='coolwarm',     
    ax=ax,                    
    dendrogram=True           
)


ax.set_title('Top 5 Marker Genes per Cell Type', fontsize=18)
plt.tight_layout() 
plt.show()
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
In [155]:
barcode_cluster_pk_dataframes
Out[155]:
{'GSM2230757':                                           barcode    assigned_cluster  pk
 human1_lib1.final_cell_0001   GATGACGGAC-GGTGGGAT              acinar   1
 human1_lib1.final_cell_0002   GAGCGTTGCT-ACCTTCTT              acinar   0
 human1_lib1.final_cell_0003     CTTACGGG-CCATTACT              acinar   0
 human1_lib1.final_cell_0004   GATGTACACG-TTAAACTG              acinar   0
 human1_lib1.final_cell_0005   GAGATTGCGA-GTCGTCGT              acinar   1
 ...                                           ...                 ...  ..
 human1_lib3.final_cell_0736   GAGAGAGTAT-GATTTACC         endothelial   0
 human1_lib3.final_cell_0737  TGATTCGCTGG-CTTCTGGA                beta   0
 human1_lib3.final_cell_0738     GCTTACCT-GGCATGCT         endothelial   0
 human1_lib3.final_cell_0739     CGGCACAT-TGGCCTGT                beta   0
 human1_lib3.final_cell_0740     TGCCTCAC-ACATCTAT  quiescent_stellate   0
 
 [1937 rows x 3 columns],
 'GSM2230758':                                          barcode    assigned_cluster  pk
 human2_lib1.final_cell_0001    TCCAGGGA-CTGGTGCA               delta   0
 human2_lib1.final_cell_0002  GAAAGATTGT-TCCGTCCA               delta   0
 human2_lib1.final_cell_0003    AAATCAGA-AGTGATGC               alpha   0
 human2_lib1.final_cell_0004   ATAGTGGAC-GAGAGTAT               delta   0
 human2_lib1.final_cell_0005  GACTTACTCC-AAAGCCTA               delta   0
 ...                                          ...                 ...  ..
 human2_lib3.final_cell_0561    CATCGCAG-GCAACCTG  activated_stellate   0
 human2_lib3.final_cell_0562  GAGGACTTCC-CGGACAAC          macrophage   0
 human2_lib3.final_cell_0563    GTAACGTT-ATTCCTTG              ductal   0
 human2_lib3.final_cell_0564    GTGTAACC-TGTCTTTC  activated_stellate   0
 human2_lib3.final_cell_0565  GACCTACTAG-GCTTTGGC               alpha   0
 
 [1724 rows x 3 columns],
 'GSM2230759':                                           barcode assigned_cluster  pk
 human3_lib1.final_cell_0001  TGAAAACTGGT-ATGTTGGC           acinar   0
 human3_lib1.final_cell_0002    ACGGAATTT-GATCGTTT           acinar   2
 human3_lib1.final_cell_0003  TGAGGCGGTTT-AAAGCCTA           acinar   2
 human3_lib1.final_cell_0004     ACGTATAC-TGATAACA           acinar   1
 human3_lib1.final_cell_0005    AAATGAATG-GAAAGACC           acinar   2
 ...                                           ...              ...  ..
 human3_lib4.final_cell_0895  TGACTAGTAAC-TTAAACTG            alpha   0
 human3_lib4.final_cell_0896     TGCTATTT-AAACTCGA             beta   0
 human3_lib4.final_cell_0897     GCTTACCT-GGCCTAAG            alpha   0
 human3_lib4.final_cell_0898   GACCCATAGC-CTCCGCAT             beta   0
 human3_lib4.final_cell_0899   GAACTGCCGT-TCTCACTT            alpha   0
 
 [3605 rows x 3 columns],
 'GSM2230760':                                           barcode    assigned_cluster  pk
 human4_lib1.final_cell_0001    AATATCTTC-AGTGAAAG              ductal   0
 human4_lib1.final_cell_0002     AGGCAACG-GCATGGGT               delta   0
 human4_lib1.final_cell_0003    AACGCAGAG-TTGTCGCC               delta   0
 human4_lib1.final_cell_0004     CGGCTTAC-CGGGCTTT              ductal   0
 human4_lib1.final_cell_0005    AAGCTACGG-TGTAGTTT              ductal   1
 ...                                           ...                 ...  ..
 human4_lib3.final_cell_0697   GAGATCTCGG-GTCTCTCT  activated_stellate   0
 human4_lib3.final_cell_0698     GCTTACCT-ATGTTGGC               alpha   0
 human4_lib3.final_cell_0699  TGACACAGTTT-TTGTCGCC                beta   0
 human4_lib3.final_cell_0700   GACGACTCCT-CGCTAATA                beta   0
 human4_lib3.final_cell_0701     TGATGCCC-TTGCACGC              ductal   0
 
 [1303 rows x 3 columns]}
In [157]:
combined_adata
Out[157]:
AnnData object with n_obs × n_vars = 8499 × 13774
    obs: 'batch', 'leiden', 'Alpha_score', 'Beta_score', 'Delta_score', 'PP_score', 'Acinar_score', 'Ductal_score', 'Endothelial_score', 'Macrophage_score', 'Stellate_score', 'cell_type'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'pca', 'batch_colors', 'neighbors', 'umap', 'leiden', 'leiden_colors', 'rank_genes_groups', 'dendrogram_cell_type'
    obsm: 'X_pca', 'X_umap', 'X_umap_orig', 'X_pca_harmony', 'X_umap_harmony'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
In [159]:
combined_adata.write('combined_adata.h5ad')